12 research outputs found
Generative Pre-training for Speech with Flow Matching
Generative models have gained more and more attention in recent years for
their remarkable success in tasks that required estimating and sampling data
distribution to generate high-fidelity synthetic data. In speech,
text-to-speech synthesis and neural vocoder are good examples where generative
models have shined. While generative models have been applied to different
applications in speech, there exists no general-purpose generative model that
models speech directly. In this work, we take a step toward this direction by
showing a single pre-trained generative model can be adapted to different
downstream tasks with strong performance. Specifically, we pre-trained a
generative model, named SpeechFlow, on 60k hours of untranscribed speech with
Flow Matching and masked conditions. Experiment results show the pre-trained
generative model can be fine-tuned with task-specific data to match or surpass
existing expert models on speech enhancement, separation, and synthesis. Our
work suggested a foundational model for generation tasks in speech can be built
with generative pre-training.Comment: Preprint, under revie
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Large-scale generative models such as GPT and DALL-E have revolutionized
natural language processing and computer vision research. These models not only
generate high fidelity text or image outputs, but are also generalists which
can solve tasks not explicitly taught. In contrast, speech generative models
are still primitive in terms of scale and task generalization. In this paper,
we present Voicebox, the most versatile text-guided generative model for speech
at scale. Voicebox is a non-autoregressive flow-matching model trained to
infill speech, given audio context and text, trained on over 50K hours of
speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can
perform many different tasks through in-context learning, but is more flexible
as it can also condition on future context. Voicebox can be used for mono or
cross-lingual zero-shot text-to-speech synthesis, noise removal, content
editing, style conversion, and diverse sample generation. In particular,
Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both
intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs
0.681) while being up to 20 times faster. See voicebox.metademolab.com for a
demo of the model
Scaling Speech Technology to 1,000+ Languages
Expanding the language coverage of speech technology has the potential to
improve access to information for many more people. However, current speech
technology is restricted to about one hundred languages which is a small
fraction of the over 7,000 languages spoken around the world. The Massively
Multilingual Speech (MMS) project increases the number of supported languages
by 10-40x, depending on the task. The main ingredients are a new dataset based
on readings of publicly available religious texts and effectively leveraging
self-supervised learning. We built pre-trained wav2vec 2.0 models covering
1,406 languages, a single multilingual automatic speech recognition model for
1,107 languages, speech synthesis models for the same number of languages, as
well as a language identification model for 4,017 languages. Experiments show
that our multilingual speech recognition model more than halves the word error
rate of Whisper on 54 languages of the FLEURS benchmark while being trained on
a small fraction of the labeled data
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input’s length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from O(N^2) to O(N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequence